Skip to content

ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API #8088

Closed
kszucs wants to merge 80 commits into
apache:masterfrom
kszucs:py2ar
Closed

ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API #8088
kszucs wants to merge 80 commits into
apache:masterfrom
kszucs:py2ar

Conversation

@kszucs

@kszucs kszucs commented Aug 31, 2020

Copy link
Copy Markdown
Member

Targets of the refactoring:

  • PythonToArrow converters based on a common API
  • PyBytesView to use Result return values and contain is_utf8 flag
  • PyConversionOptions is now available from all converters so we can honor its flags

Fixes

  • ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects
  • ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output
  • ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values
  • ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names
  • ARROW-9999 [Python] Support constructing dictionary array directly through pa.array()
  • ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs
  • ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar
  • ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array
  • ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity
  • ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe

Backward incompatibility

Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the .as_py() representation to return with a list of key-value pairs.

TODOs:

  • ensure that the large memory tests are passing
  • benchmark and check binary size again

Library size

Before:

 12M Sep 25 15:05 libarrow.200.0.0.dylib
2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib

After:

 12M Sep 25 15:46 libarrow.200.0.0.dylib
2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib

Benchmarks

Executed the following ASV benchmark:

asv continuous --bench convert_builtins master py2ar --no-only-changed --split

After some optimization:

Benchmarks that have improved:

       before           after         ratio
     [f358a29b]       [18d1c052]
     <master>         <py2ar>
-     2.78±0.03ms      2.45±0.03ms     0.88  convert_builtins.ConvertPyListToArray.time_convert('bool')
-     3.59±0.01ms      3.12±0.02ms     0.87  convert_builtins.ConvertPyListToArray.time_convert('int32')
-     3.37±0.01ms      2.73±0.01ms     0.81  convert_builtins.ConvertPyListToArray.time_convert('uint32')
-     3.74±0.02ms      3.03±0.01ms     0.81  convert_builtins.ConvertPyListToArray.time_convert('int64')
-     3.38±0.01ms      2.69±0.01ms     0.80  convert_builtins.ConvertPyListToArray.time_convert('uint64')
-     2.83±0.01ms      2.24±0.01ms     0.79  convert_builtins.ConvertPyListToArray.time_convert('float32')
-     3.92±0.02ms      2.99±0.02ms     0.76  convert_builtins.ConvertPyListToArray.time_convert('binary10')
-     14.1±0.04ms      8.89±0.05ms     0.63  convert_builtins.ConvertPyListToArray.time_convert('unicode')
-     5.60±0.01ms      3.24±0.03ms     0.58  convert_builtins.ConvertPyListToArray.time_convert('ascii')
-     5.37±0.02ms      2.91±0.04ms     0.54  convert_builtins.ConvertPyListToArray.time_convert('binary')

Benchmarks that have stayed the same:

       before           after         ratio
     [f358a29b]       [18d1c052]
     <master>         <py2ar>
      14.8±0.02ms       15.5±0.1ms     1.05  convert_builtins.ConvertPyListToArray.time_convert('decimal')
       16.4±0.7ms       15.1±0.6ms     0.92  convert_builtins.ConvertPyListToArray.time_convert('struct from tuples')
       34.4±0.3ms       31.5±0.4ms     0.92  convert_builtins.ConvertPyListToArray.time_convert('int64 list')
       16.7±0.7ms       15.1±0.6ms    ~0.91  convert_builtins.ConvertPyListToArray.time_convert('struct')
      2.42±0.02ms      2.05±0.03ms    ~0.85  convert_builtins.ConvertPyListToArray.time_convert('float64')

@wesm

wesm commented Sep 1, 2020

Copy link
Copy Markdown
Member

Are there some related JIRAs that will be fixed by this?

@kszucs

kszucs commented Sep 1, 2020

Copy link
Copy Markdown
Member Author

Not directly, but this will help to solve multiple ones. I'm going to create one.

@kszucs

kszucs commented Sep 1, 2020

Copy link
Copy Markdown
Member Author

Just executed the ASV benchmarks on this PR and the master branch with the two sample sizes and the results showed slighty better performance in both cases:

==================== ========= =========
        type            PR      master 
-------------------- --------- ---------
       int32          136±0ms   138±0ms
       uint32         129±0ms   133±0ms
       int64          131±0ms   138±0ms
       uint64         131±0ms   137±0ms
      float32         131±0ms   131±0ms
      float64         130±0ms   132±0ms
        bool          128±0ms   131±0ms
      decimal         232±0ms   227±0ms
       binary         149±0ms   180±0ms
      binary10        141±0ms   151±0ms
       ascii          158±0ms   160±0ms
      unicode         300±0ms   319±0ms
     int64 list       310±0ms   307±0ms
       struct         224±0ms   241±0ms
 struct from tuples   231±0ms   230±0ms
==================== ========= =========

Also compared the libarrow_python binary size:
master: 2.7M
this PR: 2.2M

So far there is no performance degradation - if we can trust the ASV benchmarks.

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why separate PyArrayConverter and ArrayConverter?

Comment thread cpp/src/arrow/array/builder_base.h Outdated
Comment thread cpp/src/arrow/ipc/metadata_internal.cc Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated
@kszucs

kszucs commented Sep 1, 2020

Copy link
Copy Markdown
Member Author

Why separate PyArrayConverter and ArrayConverter?

My hope is that the classes in converter.h can be reused in the future to implement different converter APIs, as an example it could be used in the R bindings as well. It would be even better if we could centralize converter agnostic logic here like auto chunking.

@kszucs kszucs changed the title [C++][Python] Refactor python to arrow conversions based on a reusable conversion API [WIP] [C++][Python] Refactor python to arrow conversions based on a reusable conversion API Sep 1, 2020
@bkietz bkietz self-requested a review September 2, 2020 15:53

@bkietz bkietz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a neat cleanup of the conversion code, and potentially reusable as well. Thanks for doing this! I have a few comments on the overall structure below

Comment thread cpp/src/arrow/python/python_test.cc Outdated
Comment thread cpp/src/arrow/util/hashing.h
Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/python/python_to_arrow.cc Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated
@kszucs

kszucs commented Sep 8, 2020

Copy link
Copy Markdown
Member Author

@romainfrancois since the conversion code in python_to_arrow.cc and r/src/array_from_vector.cpp are pretty similar it would be nice to reuse as much code as we could (including new features like auto chunking large inputs).

Could you please take a look at the base classes in converter.h to see whether would it help/improve the R side implementation? I have not documented it yet because it's still under discussion but you can take a look how I'm using it in the python bindings.

@kszucs kszucs changed the title [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API Sep 14, 2020
@apache apache deleted a comment from github-actions Bot Sep 14, 2020
@kszucs

kszucs commented Sep 15, 2020

Copy link
Copy Markdown
Member Author

@github-actions crossbow submit test-spark

@github-actions

Copy link
Copy Markdown

Revision: 79e8229d6828f7f63f80db75a4ef89af9deb03d3

Submitted crossbow builds: ursa-labs/crossbow @ actions-541

Task Status
test-conda-python-3.7-spark-branch-3.0 Github Actions
test-conda-python-3.8-spark-master Github Actions

@kszucs

kszucs commented Sep 15, 2020

Copy link
Copy Markdown
Member Author

@ursabot build

Comment thread python/pyarrow/scalar.pxi Outdated

@bkietz bkietz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunker is neat, thanks for working on this

My main concern here is overcomplexity of the Converter<> interface, since it currently bears responsibilities both as the abstract base class for conversion and as a trait mapping from DataType subclasses to concrete converters (used to enable generic construction).

I think Converter<>, Chunker<> and the mixins PrimitiveConverter<> etc can remain in util/converter.h but costruction logic (MakeConverterImpl) should be moved back to python_to_arrow.h. Fully generic construction is definitely a worthwhile goal but I also think it's worth doing so in a dedicated follow up. Establishing the base classes for generic conversion is already a great step forward

Comment thread cpp/src/arrow/array/builder_binary.h Outdated
Comment thread cpp/src/arrow/python/python_to_arrow.h Outdated
Comment thread cpp/src/arrow/python/python_to_arrow.h Outdated
Comment thread cpp/src/arrow/scalar_test.cc Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/util/converter.h
Comment thread cpp/src/arrow/util/converter.h
Comment thread cpp/src/arrow/util/converter.h Outdated
@kszucs kszucs force-pushed the py2ar branch 2 times, most recently from 33c5aa7 to f3be748 Compare September 19, 2020 10:18

@bkietz bkietz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking much better, just a few more simplifications to construction logic and I think this will be ready to go

Comment thread python/pyarrow/array.pxi Outdated
Comment thread cpp/src/arrow/python/python_to_arrow.cc Outdated
Comment thread cpp/src/arrow/python/python_to_arrow.cc Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The converter's Reserve() might fail due to overflow, but this shouldn't be an error for Chunker::Reserve since an overflow can be averted by finishing the chunk. See for example the logic surrounding ChunkedBinaryBuilder::extra_capacity_. This doesn't need to be handled in this PR since Reserve is a performance hint; for now I think we can just ignore errors emitted by this call to Reserve

Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated

@bkietz bkietz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments:

Comment thread cpp/src/arrow/util/converter.h Outdated
Comment thread cpp/src/arrow/python/python_to_arrow.cc Outdated
Comment thread cpp/src/arrow/python/python_to_arrow.cc Outdated
Comment thread cpp/src/arrow/util/converter.h Outdated

@jorisvandenbossche jorisvandenbossche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(only briefly looked at the python code + tests, impressive set of changes ;-))

Comment thread python/pyarrow/tests/test_types.py Outdated
Comment thread python/pyarrow/array.pxi Outdated
Comment thread python/pyarrow/array.pxi Outdated
Comment thread python/pyarrow/scalar.pxi Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't raise on "d" being present here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tolerates extra items (similarly when converting from dicts). I can restrict it if you think that would be more appropiate. We may want to alter this behavior with specific conversion flags.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the order matters when constructing from pairs (based on the test below), I would personally expect additional elements would also not be allowed (BTW, what actually happens if the additional element is not in the last position? Maybe add a test for that as well?).
While when constructing from dicts also the order does not matter I suppose? So this could be more tolerant

@kszucs kszucs Sep 22, 2020

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will raise since the key field equality is checked explicitly. It also supports converting less elements than the number of fields.

Comment thread python/pyarrow/tests/test_convert_builtin.py Outdated
Comment thread cpp/src/arrow/python/python_to_arrow.cc Outdated

@bkietz bkietz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants